AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.45)

Ki, Dohyeong, Guntuboyina, Adityanand

What Functions Does XGBoost Learn?

arXiv.org Machine LearningJan-12-2026

This paper establishes a rigorous theoretical foundation for the function class implicitly learned by XGBoost, bridging the gap between its empirical success and our theoretical understanding. We introduce an infinite-dimensional function class $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ that extends finite ensembles of bounded-depth regression trees, together with a complexity measure $V^{d, s}_{\infty-\text{XGB}}(\cdot)$ that generalizes the $L^1$ regularization penalty used in XGBoost. We show that every optimizer of the XGBoost objective is also an optimizer of an equivalent penalized regression problem over $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ with penalty $V^{d, s}_{\infty-\text{XGB}}(\cdot)$, providing an interpretation of XGBoost as implicitly targeting a broader function class. We also develop a smoothness-based interpretation of $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ and $V^{d, s}_{\infty-\text{XGB}}(\cdot)$ in terms of Hardy--Krause variation. We prove that the least squares estimator over $\{f \in \mathcal{F}^{d, s}_{\infty-\text{ST}}: V^{d, s}_{\infty-\text{XGB}}(f) \le V\}$ achieves a nearly minimax-optimal rate of convergence $n^{-2/3} (\log n)^{4(\min(s, d) - 1)/3}$, thereby avoiding the curse of dimensionality. Our results provide the first rigorous characterization of the function space underlying XGBoost, clarify its connection to classical notions of variation, and identify an important open problem: whether the XGBoost algorithm itself achieves minimax optimality over this class.

artificial intelligence, machine learning, variation, (19 more...)

2601.05444

Country: North America > United States (0.92)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Neural Information Processing SystemsNov-21-2025, 14:41:42 GMT

A Simple Practical Accelerated Method for Finite Sums

name change, proceedings, simple practical accelerated method, (2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.45)

Alberto Bietti, Julien Mairal

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Neural Information Processing SystemsNov-21-2025, 11:28:39 GMT

In this paper, we are interested in a variant of (1) where random perturbations of data are introduced, which is a common scenario in machine learning.

algorithm, artificial intelligence, machine learning, (15 more...)

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Neural Information Processing SystemsJan-24-2025, 02:53:25 GMT

Reviews: An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums

Motivated by the APCG method for empirical risk minimization, this paper proposed an an accelerated decentralized stochastic algorithm for finite sums. An augmented communication graph is proposed such that the original constrained optimization problem can be transformed to its dual formulation, which can be solved by stochastic coordinate descent. The proposed method employs randomized pairwise communication and stochastic computation. An adaptive sampling scheme for selecting edge is introduced, which is analogous to importance sampling in the literature. The theoretical analysis shows that the proposed algorithm achieves an optimal linear convergence rate for finite-sums, and the time complexity is better than the existing results.

accelerated decentralized stochastic proximal algorithm, finite sum, local sample size, (3 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.58)

Neural Information Processing SystemsJan-24-2025, 02:53:14 GMT

Reviews: An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums

Optimization under communication constraints is an important area of machine learning and there is still much to be gained from rigorous research in this area. Statistical literature does not concern itself much with inter- and intra-processor communication; nor does optimization (operation research) literature. Machine learning is a natural community for this research to be advanced. This paper uses a randomization scheme for communicating information between nodes to achieve impressive learning rates. The authors derive theoretical bounds on the estimation error induced by the communication constraint and show that they achieve improvements over state-of-the-art approaches with some empirical experiments.

accelerated decentralized stochastic proximal algorithm, communication constraint, finite sum, (3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.87)

Alberto Bietti, Julien Mairal

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Neural Information Processing SystemsOct-3-2024, 19:10:31 GMT

Stochastic optimization algorithms with variance reduction have proven successful for minimizing large finite sums of functions. Unfortunately, these techniques are unable to deal with stochastic perturbations of input data, induced for example by data augmentation. In such cases, the objective is no longer a finite sum, and the main candidate for optimization is the stochastic gradient descent method (SGD). In this paper, we introduce a variance reduction approach for these settings when the objective is composite and strongly convex. The convergence rate outperforms SGD with a typically much smaller constant factor, which depends on the variance of gradient estimates only due to perturbations on a single example.

algorithm, dataset, perturbation, (13 more...)

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Chzhen, Evgenii, Schechtman, Sholom

SignSVRG: fixing SignSGD via variance reduction

arXiv.org Machine LearningMay-22-2023

We consider the problem of unconstrained minimization of finite sums of functions. We propose a simple, yet, practical way to incorporate variance reduction techniques into SignSGD, guaranteeing convergence that is similar to the full sign gradient descent. The core idea is first instantiated on the problem of minimizing sums of convex and Lipschitz functions and is then extended to the smooth case via variance reduction. Our analysis is elementary and much simpler than the typical proof for variance reduction methods. We show that for smooth functions our method gives $\mathcal{O}(1 / \sqrt{T})$ rate for expected norm of the gradient and $\mathcal{O}(1/T)$ rate in the case of smooth convex functions, recovering convergence results of deterministic methods, while preserving computational advantages of SignSGD.

algorithm, artificial intelligence, machine learning, (14 more...)

2305.13187

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

arXiv.org Machine LearningJan-27-2021

A Note on the Representation Power of GHHs

Lu, Zhou

In this note we prove a sharp lower bound on the necessary number of nestings of nested absolute-value functions of generalized hinging hyperplanes (GHH) to represent arbitrary CPWL functions. Previous upper bound states that $n+1$ nestings is sufficient for GHH to achieve universal representation power, but the corresponding lower bound was unknown. We prove that $n$ nestings is necessary for universal representation power, which provides an almost tight lower bound. We also show that one-hidden-layer neural networks don't have universal approximation power over the whole domain. The analysis is based on a key lemma showing that any finite sum of periodic functions is either non-integrable or the zero function, which might be of independent interest.

cpwl function, neural network, representation power, (14 more...)

2101.11286

Genre: Research Report (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningNov-8-2020

Structured Logconcave Sampling with a Restricted Gaussian Oracle

Lee, Yin Tat, Shen, Ruoqi, Tian, Kevin

We give algorithms for sampling several structured logconcave families to high accuracy. We further develop a reduction framework, inspired by proximal point methods in convex optimization, which bootstraps samplers for regularized densities to improve dependences on problem conditioning. A key ingredient in our framework is the notion of a "restricted Gaussian oracle" (RGO) for $g: \mathbb{R}^d \rightarrow \mathbb{R}$, which is a sampler for distributions whose negative log-likelihood sums a quadratic and $g$. By combining our reduction framework with our new samplers, we obtain the following bounds for sampling structured distributions to total variation distance $\epsilon$. For composite densities $\exp(-f(x) - g(x))$, where $f$ has condition number $\kappa$ and convex (but possibly non-smooth) $g$ admits an RGO, we obtain a mixing time of $O(\kappa d \log^3\frac{\kappa d}{\epsilon})$, matching the state-of-the-art non-composite bound; no composite samplers with better mixing than general-purpose logconcave samplers were previously known. For logconcave finite sums $\exp(-F(x))$, where $F(x) = \frac{1}{n}\sum_{i \in [n]} f_i(x)$ has condition number $\kappa$, we give a sampler querying $\widetilde{O}(n + \kappa\max(d, \sqrt{nd}))$ gradient oracles to $\{f_i\}_{i \in [n]}$; no high-accuracy samplers with nontrivial gradient query complexity were previously known. For densities with condition number $\kappa$, we give an algorithm obtaining mixing time $O(\kappa d \log^2\frac{\kappa d}{\epsilon})$, improving the prior state-of-the-art by a logarithmic factor with a significantly simpler analysis; we also show a zeroth-order algorithm attains the same query complexity.

algorithm, exp, oracle, (15 more...)

2010.03106

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Montenegro (0.04)
(14 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)